Assembler Assignment Write an assembler for the SIM5 assembly language in C. In other words, if the input to your C program is: main: ld a mov (%r0),%r2 out %r2 halt a: .word main .end main your C program should output the machine language program: 000 1004 001 9102 002 8102 003 0000 004 0000 000 When executed by your SIM5 simulator, the machine language program should output 0000 (the contents of a). The program must be completed in stages and each stage must be handed in before you receive any help (and any credit) for subsequent stages. The four stages for the first week are as follows: 1. Write C code that inputs a "free format" assembly language program and prints a formatted version with labels starting in column 1, opcodes and assembly directives starting in position 11, and operands starting in position 21. For example Input Output main: ld a main: ld a mov (%r0),%r2 mov (%r0),%r2 out %r2 out %r2 halt halt alta: alta: a: .word main a: .word main .end main .end main To simplify the assignment, you may assume there are no comments. However, there are some other questions to be answered about the syntax of your assembler (and the answers need to be in your documentation). 1. Are blank lines allowed? 2. Can opcodes and assembly directives begin in column 1? 3. can an address (004 in this example) have two or more names (alta and a) and can labels appear on lines by themselves? 2. Modify program (1) so that all lines are read before any line is printed. This is a big deal because you are deciding how you are going to store the program and this effects all of you subsequent code. For example. a. You could use multi dimensional arrays such as label[1200][11], opcode[1200][11] and operand[1200][11] to store the three fields. (You could also split opcode[1200][11] into opcode[1200][11] and directive[1200][11]. See section 5.7 (multi dimensional arrays) in K&R. b. Instead of using multi dimensional arrays, you could store the assembly program as an array of structures (see Chapter 6 and K&R). For example: struct line { char label[11]; char opcode[11]; char operand[11]; } program[1200]; program[0] >label = "main:"; program[0] >opcode = "ld"; program[0] >operand = "a"; 3. Print out the symbol table. For the program above, your C code should print: main 0000 alta 0004 a 0004 You should create functions such as "int addsym(int addr, char *label)" and "int symval(char *label)" to manipulate your symbol table. Make sure you catch all of the errors such as "duplicate symbol" or "illegal label" or "label not found". Some Hints. 1. You could use matrices to store character strings. For example, use opcode[30][11] to store the 30 opcodes and assembly directives (each of which can contain up to 10 characters). The line: "char opcode[30][10] = {"halt", "ld", "st", "add", "sub", "lda", ".word". ...};" will initialize the array (see page 113 in K&R and "printf("%s\n", opcode[3])" will print "add". 2. If you pass matrices to functions, you must specify the size of the second subscript when you call the funcion function(opcode[30][10]) or function(opcode[][10]) but not function(opcode[][]). See page 112 in K&R. It may be easier define such arrays as "external" by declaring them outside all functions. See page 31 in K&R. 3. Use code from the book such as getline(char s[], int lim) on page 29 and strlen(char s[]) (but don't borrow code from your neighbor) If you store all of the tokens (symbolic addresses, opcodes, operands) as 10 byte strings, you can safely use copy(char[] to, char[] from) on page 29. (Note that "copy(opcode[3], opcode[5]) is legal in c and would make the 3rd opcode a copy of the 5th opcode). strlen(opcode(3)) (K%R, page 39) returns the length of the 4th symbolic opcode. 4. Use code from the C library. See pages 241 to 250 in K&R. 5. The AT&T assembler uses "#" as the comment character rather than ";". When you encounter the symbol "#" in an assembly program, ignore all characters beginning with the comment character up to (but not including) the next "\n" character. 6. You must parse each line into 0 to 3 tokens (label or symbolic address, op code or assembly directive, operands). This is very easy if you steal code from function readcode() in sim1.c (use %s instead of %d in sscanf(). Do you undestand why sscanf() requires the "&" symbol? 7. If the first token ends with ":", it is a label defining a symbolic address, otherwise it must be an opcode or assembly directive. 8. Think before you code. Because you must build the symbol table, and assembler requires at least two "passes". 9. Design your program as a simple main program that simply calls functions that you write and debug separately. You need to write a separate main program just to test each function. If you bring the lab instructors a program with untested functions, they have been directed not to try to debug your code. 10. You choice of functions makes the assignment easy or impossible. A function to turn a symbolic opcode into a numerial opcode? A function to turn a symbolic address into a numerical address? A function to add a symbolic address to the symbol table? 11. Ditto for your choice of data structures and external (global) variables. 12. If you have a string that might or might not be an integer, the statement "count = scanf("%d, \n", &token);" will return 0 or 1 (and place the value in &token if it returns 1). 13. For a perfect program, you should handle the "dot" directives like ".=100" or ".=.+50". However, is requires a bit of a hack, so I suggest to solve the assignment without considering these statements and then modify it if you have time.